tidyverseR
tidyverse)tidyverse?
rvest quick start guide
rvest
RR PackagesLots to choose from: XML, XML2R, scrapeR, selectr, rjson, RSelenium, etc.
Many more (and links to the above) on the Web Technologies CRAN Task View
But, we’ll be using the tidyverse packages rvest and xml2
tidyverse?“The tidyverse is a set of packages that work in harmony…. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.” - RStudio Blog
You may already have used:
ggplot2 for visualizationdplyr for data manipulationtidyr for data tidyingInstall all tidyverse packages in one fell swoop:
# check if you already have it
library(tidyverse)
# if not:
install.packages("tidyverse")
library(tidyverse) # only calls the "core" of tidyversetidyverse packageshttr: for web APIs (Application Programming Interface)jsonlite: for JSON (JavaScript Object Notation) data from the webxml2: for XML (eXtensible Markup Language) structured datarvest: package of wrapper functions to xml2 and httr for easy web scrapingWe’ll focus on rvest
rvest:What data do you want?
Find it on the web!
# character variable containing the url you want to scrape
myurl <- "http://www.imdb.com/title/tt4975722/"R“Huh? What am I doing?” - some of you right now
library(tidyverse)
library(rvest)
myhtml <- read_html(myurl)
myhtml## {xml_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset= ...
## [2] <body id="styleguide-v2" class="fixed">\n<script>\n if (typeof ue ...
Need to find your data within the myhtml object.
Tags to look for:
<p>: paragraphs<h1>, <h2>, etc.: headers<a>: links<li>: item in a list<table>: tablesUse Selector Gadget to find the exact location. (Demo)
For more on HTML, I recommend W3schools’ tutorial >- You don’t need to be an expert in HTML to webscrape with rvest!
rvest where to find your dataCopy-paste from Selector Gadget or give HTML tags into html_nodes() to extract your data of interest
myhtml %>% html_nodes(".summary_text") %>% html_text()## [1] "\n A timeless story of human self-discovery and connection, Moonlight chronicles the life of a young black man from childhood to adulthood as he struggles to find his place in the world while growing up in a rough neighborhood of Miami.\n "
myhtml %>% html_nodes("table") %>% html_table(header = TRUE)## [[1]]
## Cast overview, first billed only: Cast overview, first billed only:
## 1 NA Mahershala Ali
## 2 NA Shariff Earp
## 3 NA Duan Sanderson
## 4 NA Alex R. Hibbert
## 5 NA Janelle Monáe
## 6 NA Naomie Harris
## 7 NA Jaden Piner
## 8 NA Herman 'Caheei McGloun
## 9 NA Kamal Ani-Bellow
## 10 NA Keomi Givens
## 11 NA Eddie Blanchard
## 12 NA Rudi Goblen
## 13 NA Ashton Sanders
## 14 NA Edson Jean
## 15 NA Patrick Decile
## Cast overview, first billed only:
## 1 ...
## 2 ...
## 3 ...
## 4 ...
## 5 ...
## 6 ...
## 7 ...
## 8 ...
## 9 ...
## 10 ...
## 11 ...
## 12 ...
## 13 ...
## 14 ...
## 15 ...
## Cast overview, first billed only:
## 1 Juan
## 2 Terrence
## 3 Azu \n \n \n (as Duan 'Sandy' Sanderson)
## 4 Little \n \n \n (as Alex Hibbert)
## 5 Teresa
## 6 Paula
## 7 Kevin (9)
## 8 Longshoreman \n \n \n (as Herman 'Caheej' McCloun)
## 9 Portable Boy 1
## 10 Portable Boy 2
## 11 Portable Boy 3
## 12 Gee
## 13 Chiron
## 14 Mr. Pierce
## 15 Terrel
##
## [[2]]
## Straight blk male friends won't see it because they think its a gay film
## 1 My problem with this movie
## 2 Most overrated movie of the Year - BORING
## 3 What happened to the bully?
## 4 Meh
## 5 Do you really think Mahershala Ali deserves the hype for this role?
## cliffcarson-502-470231
## 1 s_a-k_y
## 2 JSchoenleber70
## 3 cuterstar
## 4 jamesforsythe
## 5 jvcksonsmth
##
## [[3]]
## Amazon Affiliates
## 1 Amazon VideoWatch Movies &TV Online
## Amazon Affiliates
## 1 Prime VideoUnlimited Streamingof Movies & TV
## Amazon Affiliates
## 1 Amazon GermanyBuy Movies onDVD & Blu-ray
## Amazon Affiliates
## 1 Amazon ItalyBuy Movies onDVD & Blu-ray
## Amazon Affiliates
## 1 Amazon FranceBuy Movies onDVD & Blu-ray
## Amazon Affiliates Amazon Affiliates
## 1 Amazon IndiaBuy Movie andTV Show DVDs DPReviewDigitalPhotography
## Amazon Affiliates
## 1 AudibleDownloadAudio Books
library(stringr)
library(magrittr)
mydat <- myhtml %>%
html_nodes("table") %>%
extract2(1) %>%
html_table(header = TRUE)
mydat <- mydat[,c(2,4)]
names(mydat) <- c("Actor", "Role")
mydat <- mydat %>%
mutate(Actor = Actor,
Role = str_replace_all(Role, "\n ", ""))
mydat## Actor Role
## 1 Mahershala Ali Juan
## 2 Shariff Earp Terrence
## 3 Duan Sanderson Azu (as Duan 'Sandy' Sanderson)
## 4 Alex R. Hibbert Little (as Alex Hibbert)
## 5 Janelle Monáe Teresa
## 6 Naomie Harris Paula
## 7 Jaden Piner Kevin (9)
## 8 Herman 'Caheei McGloun Longshoreman (as Herman 'Caheej' McCloun)
## 9 Kamal Ani-Bellow Portable Boy 1
## 10 Keomi Givens Portable Boy 2
## 11 Eddie Blanchard Portable Boy 3
## 12 Rudi Goblen Gee
## 13 Ashton Sanders Chiron
## 14 Edson Jean Mr. Pierce
## 15 Patrick Decile Terrel
Using rvest, scrape a table from Wikipedia. You can pick your own table or you can get one of the tables in the country GDP per capita example from earlier.
Your result should be a data frame with one observation per row and one variable per column.
library(rvest)
library(magrittr)
myurl <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita"
myhtml <- read_html(myurl)
myhtml %>%
html_nodes("table") %>%
extract2(2) %>%
html_table(header = TRUE) %>%
mutate(`Int$` = parse_number(`Int$`)) %>%
head## Rank Country Int$
## 1 1 Qatar 129727
## 2 2 Luxembourg 101936
## 3 3 Macau 96148
## 4 4 Singapore 87082
## 5 5 Brunei Darussalam 79711
## 6 6 Kuwait 71264
rvesthtml_nodeshtml_nodes(x, "path") extracts all elements from the page x that have the tag / class / id path. (Use SelectorGadget to determine path.)html_node() does the same thing but only returns the first matching element.myhtml %>%
html_nodes("p") %>% # first get all the paragraphs
html_nodes("a") # then get all the links in those paragraphs## {xml_nodeset (22)}
## [1] <a href="/wiki/Purchasing_power_parity" title="Purchasing power par ...
## [2] <a href="/wiki/Goods_and_services" title="Goods and services">goods ...
## [3] <a href="/wiki/Gross_domestic_product" title="Gross domestic produc ...
## [4] <a href="/wiki/Per_capita" title="Per capita">per capita</a>
## [5] <a href="/wiki/International_Monetary_Fund" title="International Mo ...
## [6] <a href="/wiki/World_Bank" title="World Bank">World Bank</a>
## [7] <a href="/wiki/National_wealth" title="National wealth">national we ...
## [8] <a href="/wiki/Savings" class="mw-redirect" title="Savings">savings ...
## [9] <a href="/wiki/Cost_of_living" title="Cost of living">cost of livin ...
## [10] <a href="/wiki/List_of_countries_by_GDP_(nominal)_per_capita" title ...
## [11] <a href="https://en.wiktionary.org/wiki/generalized" class="extiw" ...
## [12] <a href="/wiki/Living_standards" class="mw-redirect" title="Living ...
## [13] <a href="/wiki/Inflation_rates" class="mw-redirect" title="Inflatio ...
## [14] <a href="/wiki/Exchange_rates" class="mw-redirect" title="Exchange ...
## [15] <a href="#cite_note-2">[2]</a>
## [16] <a href="#cite_note-3">[3]</a>
## [17] <a href="/wiki/Personal_income" title="Personal income">personal in ...
## [18] <a href="/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_W ...
## [19] <a href="/wiki/Economy" title="Economy">economies</a>
## [20] <a href="/wiki/Sovereign_state" title="Sovereign state">sovereign s ...
## ...
html_texthtml_text(x) extracts all text from the nodeset xmyhtml %>%
html_nodes("p") %>% # first get all the paragraphs
html_nodes("a") %>% # then get all the links in those paragraphs
html_text() # get the linked text only ## [1] "purchasing power parity"
## [2] "goods and services"
## [3] "gross domestic product"
## [4] "per capita"
## [5] "International Monetary Fund"
## [6] "World Bank"
## [7] "national wealth"
## [8] "savings"
## [9] "cost of living"
## [10] "List of countries by GDP (nominal) per capita"
## [11] "generalized"
## [12] "living standards"
## [13] "inflation rates"
## [14] "exchange rates"
## [15] "[2]"
## [16] "[3]"
## [17] "personal income"
## [18] "Standard of living and GDP"
## [19] "economies"
## [20] "sovereign states"
## [21] "dependent territories"
## [22] "Geary–Khamis dollars"
html_tablehtml_table(x, header, fill) - parse html table(s) from x into a data frame or list of data framesmyhtml %>%
html_nodes("table") %>% # get the tables
head(2) # look at first 2## {xml_nodeset (2)}
## [1] <table style="font-size:95%;">\n<tr>\n<td width="33%" align="center" ...
## [2] <table class="wikitable sortable" style="margin-left:auto;margin-rig ...
myhtml %>%
html_nodes("table") %>% # get the tables
extract2(2) %>% # pick the second one to parse
html_table(header = TRUE) # parse table ## Rank Country Int$
## 1 1 Qatar 129,727
## 2 2 Luxembourg 101,936
## 3 3 Macau 96,148
## 4 4 Singapore 87,082
## 5 5 Brunei Darussalam 79,711
## 6 6 Kuwait 71,264
## 7 7 Ireland 69,375
## 8 8 Norway 69,296
## 9 9 United Arab Emirates 67,696
## 10 10 Saudi Arabia 65,000
## 11 11 San Marino 64,443
## 12 12 Switzerland 59,376
## 13 13 Hong Kong 58,095
## 14 14 United States 57,294
## 15 15 Netherlands 50,846
## 16 16 Bahrain 50,303
## 17 17 Sweden 49,678
## 18 18 Australia 48,806
## 19 19 Germany 48,190
## 20 20 Iceland 48,070
## 21 21 Austria 47,856
## 22 22 Taiwan 47,790
## 23 23 Denmark 46,603
## 24 24 Canada 46,240
## 25 25 Belgium 44,881
## 26 26 Oman 43,737
## 27 27 United Kingdom 42,514
## 28 28 France 42,384
## 29 29 Finland 41,813
## 30 30 Japan 38,894
## 31 31 Equatorial Guinea 38,699
## 32 32 South Korea 37,948
## 33 33 Malta 37,891
## 34 34 Puerto Rico 37,723
## 35 35 New Zealand 37,108
## 36 36 Spain 36,451
## 37 37 Italy 36,313
## 38 38 Israel 34,834
## 39 39 Cyprus 34,387
## 40 40 Czech Republic 33,223
## 41 41 Slovenia 32,028
## 42 42 Trinidad and Tobago 31,934
## 43 43 Slovak Republic 31,182
## 44 44 Lithuania 29,882
## 45 45 Estonia 29,502
## 46 46 Portugal 28,515
## 47 47 Seychelles 28,148
## 48 48 Poland 27,715
## 49 49 Malaysia 27,234
## 50 50 Hungary 27,211
## 51 51 Greece 26,809
## 52 52 Russia 26,109
## 53 53 Latvia 25,740
## 54 54 Kazakhstan 25,669
## 55 55 St. Kitts and Nevis 25,372
## 56 56 The Bahamas 24,618
## 57 57 Antigua and Barbuda 24,050
## 58 58 Chile 23,969
## 59 59 Panama 22,788
## 60 60 Croatia 22,415
## 61 61 Romania 22,319
## 62 62 Uruguay 21,570
## 63 63 Turkey 21,147
## 64 64 Mauritius 20,525
## 65 65 Argentina 20,171
## 66 66 Bulgaria 20,116
## 67 67 Gabon 19,252
## 68 68 Mexico 18,865
## 69 69 Lebanon 18,524
## 70 70 Iran 18,136
## 71 71 Azerbaijan 17,688
## 72 72 Belarus 17,497
## 73 73 Turkmenistan 17,347
## 74 74 Barbados 17,137
## 75 75 Montenegro 17,035
## 76 76 Botswana 16,948
## 77 77 Thailand 16,835
## 78 78 Iraq 16,544
## 79 79 Costa Rica 16,142
## 80 80 Dominican Republic 15,946
## 81 81 China 15,424
## 82 82 Maldives 15,288
## 83 83 Palau 15,260
## 84 84 Brazil 15,211
## 85 85 Suriname 15,180
## 86 86 Venezuela 15,103
## 87 87 Algeria 14,950
## 88 88 Macedonia 14,530
## 89 89 Libya 14,236
## 90 90 Serbia 14,226
## 91 91 Colombia 14,162
## 92 92 Grenada 14,102
## 93 93 South Africa 13,179
## 94 94 Peru 13,019
## 95 95 Mongolia 12,161
## 96 96 Egypt 12,137
## 97 97 St. Lucia 11,970
## 98 98 Albania 11,861
## 99 99 Namibia 11,756
## 100 100 Indonesia 11,699
## 101 101 Tunisia 11,657
## 102 102 Dominica 11,484
## 103 103 St. Vincent and the Grenadines 11,267
## 104 104 Sri Lanka 11,189
## 105 105 Jordan 11,125
## 106 106 Ecuador 11,037
## 107 107 Bosnia and Herzegovina 11,034
## 108 108 Georgia 10,100
## 109 109 Swaziland 9,768
## 110 110 Paraguay 9,354
## 111 111 Fiji 9,353
## 112 112 Jamaica 8,974
## 113 113 El Salvador 8,914
## 114 114 Armenia 8,881
## 115 115 Morocco 8,360
## 116 116 Ukraine 8,230
## 117 117 Belize 8,186
## 118 118 Bhutan 8,129
## 119 119 Guatemala 7,937
## 120 120 Guyana 7,920
## 121 121 Philippines 7,696
## 122 122 Bolivia 7,191
## 123 123 Angola 6,844
## 124 124 Republic of Congo 6,787
## 125 125 Cabo Verde 6,744
## 126 126 India 6,658
## 127 127 Uzbekistan 6,453
## 128 128 Vietnam 6,422
## 129 129 Myanmar 5,953
## 130 130 Nigeria 5,930
## 131 131 Laos 5,719
## 132 132 Samoa 5,369
## 133 133 Tonga 5,332
## 134 134 Nicaragua 5,280
## 135 135 Honduras 5,264
## 136 136 Moldova 5,218
## 137 137 Pakistan 5,120
## 138 138 Sudan 4,452
## 139 139 Mauritania 4,405
## 140 140 Ghana 4,381
## 141 141 Timor-Leste 4,186
## 142 142 Zambia 3,899
## 143 143 Bangladesh 3,891
## 144 144 Cambodia 3,736
## 145 145 Côte d'Ivoire 3,581
## 146 146 Tuvalu 3,567
## 147 147 Papua New Guinea 3,542
## 148 148 Kyrgyz Republic 3,467
## 149 149 Djibouti 3,370
## 150 150 Kenya 3,360
## 151 151 São Tomé and Príncipe 3,344
## 152 152 Cameroon 3,261
## 153 153 Marshall Islands 3,240
## 154 154 Lesotho 3,107
## 155 155 Tanzania 3,097
## 156 156 Micronesia 3,033
## 157 157 Tajikistan 2,982
## 158 158 Vanuatu 2,631
## 159 159 Chad 2,580
## 160 160 Senegal 2,578
## 161 161 Yemen 2,521
## 162 162 Nepal 2,481
## 163 163 Mali 2,265
## 164 164 Benin 2,185
## 165 165 Uganda 2,067
## 166 166 Solomon Islands 1,996
## 167 167 Afghanistan 1,957
## 168 168 Zimbabwe 1,953
## 169 169 Ethiopia 1,916
## 170 170 Rwanda 1,905
## 171 171 Kiribati 1,821
## 172 172 Burkina Faso 1,791
## 173 173 Haiti 1,784
## 174 174 South Sudan 1,671
## 175 175 The Gambia 1,664
## 176 176 Sierra Leone 1,652
## 177 177 Guinea-Bissau 1,568
## 178 178 Togo 1,546
## 179 179 Comoros 1,529
## 180 180 Madagascar 1,505
## 181 181 Eritrea 1,322
## 182 182 Guinea 1,271
## 183 183 Mozambique 1,228
## 184 184 Malawi 1,139
## 185 185 Niger 1,114
## 186 186 Liberia 882
## 187 187 Burundi 818
## 188 188 Democratic Republic of the Congo 785
## 189 189 Central African Republic 656
html_attrshtml_attrs(x) - extracts all attribute elements from a nodeset xhtml_attr(x, name) - extracts the name attribute from all elements in nodeset xhref, title, class, style, etc.myhtml %>%
html_nodes("table") %>% extract2(2) %>%
html_attrs()## class
## "wikitable sortable"
## style
## "margin-left:auto;margin-right:auto;text-align: right"
myhtml %>%
html_nodes("p") %>% html_nodes("a") %>%
html_attr("href")## [1] "/wiki/Purchasing_power_parity"
## [2] "/wiki/Goods_and_services"
## [3] "/wiki/Gross_domestic_product"
## [4] "/wiki/Per_capita"
## [5] "/wiki/International_Monetary_Fund"
## [6] "/wiki/World_Bank"
## [7] "/wiki/National_wealth"
## [8] "/wiki/Savings"
## [9] "/wiki/Cost_of_living"
## [10] "/wiki/List_of_countries_by_GDP_(nominal)_per_capita"
## [11] "https://en.wiktionary.org/wiki/generalized"
## [12] "/wiki/Living_standards"
## [13] "/wiki/Inflation_rates"
## [14] "/wiki/Exchange_rates"
## [15] "#cite_note-2"
## [16] "#cite_note-3"
## [17] "/wiki/Personal_income"
## [18] "/wiki/Gross_domestic_product#Standard_of_living_and_GDP:_Wealth_distribution_and_externalities"
## [19] "/wiki/Economy"
## [20] "/wiki/Sovereign_state"
## [21] "/wiki/Dependent_territories"
## [22] "/wiki/Geary%E2%80%93Khamis_dollar"
html_children - list the “children” of the HTML page. Can be chained like html_nodeshtml_name - gives the tags of a nodeset. Use in a chain with html_childrenmyhtml %>%
html_children() %>%
html_name()## [1] "head" "body"
html_form - parses HTML forms (checkboxes, fill-in-the-blanks, etc.)html_session - simulate a session in an html browser; use the functions jump_to, back to navigate through the pageFind another website you want to scrape (ideas: all bills in the house so far this year, video game reviews, anything Wikipedia) and use at least 3 different rvest functions in a chain to extract some data.
url <- "http://avalon.law.yale.edu/subject_menus/inaug.asp"
# even though it's called "all inaugs" some are missing
all_inaugs <- (url %>%
read_html(url) %>%
html_nodes("table") %>%
html_table(fill=T, header = T)) %>% extract2(3)
# table of addresses
all_inaugs_tidy <- all_inaugs %>%
gather(term, year, -President) %>%
filter(!is.na(year)) %>%
select(-term) %>%
arrange(year)
head(all_inaugs_tidy)## President year
## 1 George Washington 1789
## 2 George Washington 1793
## 3 John Adams 1797
## 4 Thomas Jefferson 1801
## 5 Thomas Jefferson 1805
## 6 James Madison 1809
# get the links to the addresses
inaugadds_adds <- (url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href"))[12:66]
# create the urls to scrape
urlstump <- "http://avalon.law.yale.edu/"
inaugurls <- paste0(urlstump, str_replace(inaugadds_adds, "../", ""))
all_inaugs_tidy$url <- inaugurls
head(all_inaugs_tidy)## President year
## 1 George Washington 1789
## 2 George Washington 1793
## 3 John Adams 1797
## 4 Thomas Jefferson 1801
## 5 Thomas Jefferson 1805
## 6 James Madison 1809
## url
## 1 http://avalon.law.yale.edu/18th_century/wash1.asp
## 2 http://avalon.law.yale.edu/18th_century/wash2.asp
## 3 http://avalon.law.yale.edu/18th_century/adams.asp
## 4 http://avalon.law.yale.edu/19th_century/jefinau1.asp
## 5 http://avalon.law.yale.edu/19th_century/jefinau2.asp
## 6 http://avalon.law.yale.edu/19th_century/madison1.asp
get_inaugurations <- function(url){
test <- try(url %>% read_html(), silent=T)
if ("try-error" %in% class(test)) {
return(NA)
} else
url %>% read_html() %>%
html_nodes("p") %>%
html_text() -> address
return(unlist(address))
}
# takes about 30 secs to run
all_inaugs_text <- all_inaugs_tidy %>%
mutate(address_text = (map(url, get_inaugurations)))
all_inaugs_text$address_text[[1]]## [1] " Fellow-Citizens of the Senate and of the House of Representatives: "
## [2] "Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years--a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but overwhelm with despondence one who (inheriting inferior endowments from nature and unpracticed in the duties of civil administration) ought to be peculiarly conscious of his own deficiencies. In this conflict of emotions all I dare aver is that it has been my faithful study to collect my duty from a just appreciation of every circumstance by which it might be affected. All I dare hope is that if, in executing this task, I have been too much swayed by a grateful remembrance of former instances, or by an affectionate sensibility to this transcendent proof of the confidence of my fellow-citizens, and have thence too little consulted my incapacity as well as disinclination for the weighty and untried cares before me, my error will be palliated by the motives which mislead me, and its consequences be judged by my country with some share of the partiality in which they originated. "
## [3] "Such being the impressions under which I have, in obedience to the public summons, repaired to the present station, it would be peculiarly improper to omit in this first official act my fervent supplications to that Almighty Being who rules over the universe, who presides in the councils of nations, and whose providential aids can supply every human defect, that His benediction may consecrate to the liberties and happiness of the people of the United States a Government instituted by themselves for these essential purposes, and may enable every instrument employed in its administration to execute with success the functions allotted to his charge. In tendering this homage to the Great Author of every public and private good, I assure myself that it expresses your sentiments not less than my own, nor those of my fellow- citizens at large less than either. No people can be bound to acknowledge and adore the Invisible Hand which conducts the affairs of men more than those of the United States. Every step by which they have advanced to the character of an independent nation seems to have been distinguished by some token of providential agency; and in the important revolution just accomplished in the system of their united government the tranquil deliberations and voluntary consent of so many distinct communities from which the event has resulted can not be compared with the means by which most governments have been established without some return of pious gratitude, along with an humble anticipation of the future blessings which the past seem to presage. These reflections, arising out of the present crisis, have forced themselves too strongly on my mind to be suppressed. You will join with me, I trust, in thinking that there are none under the influence of which the proceedings of a new and free government can more auspiciously commence. "
## [4] "By the article establishing the executive department it is made the duty of the President \"to recommend to your consideration such measures as he shall judge necessary and expedient.\" The circumstances under which I now meet you will acquit me from entering into that subject further than to refer to the great constitutional charter under which you are assembled, and which, in defining your powers, designates the objects to which your attention is to be given. It will be more consistent with those circumstances, and far more congenial with the feelings which actuate me, to substitute, in place of a recommendation of particular measures, the tribute that is due to the talents, the rectitude, and the patriotism which adorn the characters selected to devise and adopt them. In these honorable qualifications I behold the surest pledges that as on one side no local prejudices or attachments, no separate views nor party animosities, will misdirect the comprehensive and equal eye which ought to watch over this great assemblage of communities and interests, so, on another, that the foundation of our national policy will be laid in the pure and immutable principles of private morality, and the preeminence of free government be exemplified by all the attributes which can win the affections of its citizens and command the respect of the world. I dwell on this prospect with every satisfaction which an ardent love for my country can inspire, since there is no truth more thoroughly established than that there exists in the economy and course of nature an indissoluble union between virtue and happiness; between duty and advantage; between the genuine maxims of an honest and magnanimous policy and the solid rewards of public prosperity and felicity; since we ought to be no less persuaded that the propitious smiles of Heaven can never be expected on a nation that disregards the eternal rules of order and right which Heaven itself has ordained; and since the preservation of the sacred fire of liberty and the destiny of the republican model of government are justly considered, perhaps, as deeply, as finally, staked on the experiment entrusted to the hands of the American people. "
## [5] "Besides the ordinary objects submitted to your care, it will remain with your judgment to decide how far an exercise of the occasional power delegated by the fifth article of the Constitution is rendered expedient at the present juncture by the nature of objections which have been urged against the system, or by the degree of inquietude which has given birth to them. Instead of undertaking particular recommendations on this subject, in which I could be guided by no lights derived from official opportunities, I shall again give way to my entire confidence in your discernment and pursuit of the public good; for I assure myself that whilst you carefully avoid every alteration which might endanger the benefits of an united and effective government, or which ought to await the future lessons of experience, a reverence for the characteristic rights of freemen and a regard for the public harmony will sufficiently influence your deliberations on the question how far the former can be impregnably fortified or the latter be safely and advantageously promoted. "
## [6] "To the foregoing observations I have one to add, which will be most properly addressed to the House of Representatives. It concerns myself, and will therefore be as brief as possible. When I was first honored with a call into the service of my country, then on the eve of an arduous struggle for its liberties, the light in which I contemplated my duty required that I should renounce every pecuniary compensation. From this resolution I have in no instance departed; and being still under the impressions which produced it, I must decline as inapplicable to myself any share in the personal emoluments which may be indispensably included in a permanent provision for the executive department, and must accordingly pray that the pecuniary estimates for the station in which I am placed may during my continuance in it be limited to such actual expenditures as the public good may be thought to require. "
## [7] "Having thus imparted to you my sentiments as they have been awakened by the occasion which brings us together, I shall take my present leave; but not without resorting once more to the benign Parent of the Human Race in humble supplication that, since He has been pleased to favor the American people with opportunities for deliberating in perfect tranquillity, and dispositions for deciding with unparalleled unanimity on a form of government for the security of their union and the advancement of their happiness, so His divine blessing may be equally conspicuous in the enlarged views, the temperate consultations, and the wise measures on which the success of this Government must depend. "
all_inaugs_text$President[is.na(all_inaugs_text$address_text)]## [1] "Martin Van Buren" "James Buchanan" "James A. Garfield"
## [4] "Calvin Coolidge"
# there are 7 missing at this point: obama's and trump's, plus coolidge, garfield, buchanan, and van buren, which errored in the scraping.
obama09 <- get_inaugurations("http://avalon.law.yale.edu/21st_century/obama.asp")
obama13 <- readLines("speeches/obama2013.txt")
trump17 <- readLines("speeches/trumpinaug.txt")
vanburen1837 <- readLines("speeches/vanburen1837.txt") # row 13
buchanan1857 <- readLines("speeches/buchanan1857.txt") # row 18
garfield1881 <- readLines("speeches/garfield1881.txt") # row 24
coolidge1925 <- readLines("speeches/coolidge1925.txt") # row 35
all_inaugs_text$address_text[c(13,18,24,35)] <- list(vanburen1837,buchanan1857, garfield1881, coolidge1925)
# lets combine them all now
recents <- data.frame(President = c(rep("Barack Obama", 2),
"Donald Trump"),
year = c(2009, 2013, 2017),
url = NA,
address_text = NA)
all_inaugs_text <- rbind(all_inaugs_text, recents)
all_inaugs_text$address_text[c(56:58)] <- list(obama09, obama13, trump17)Now, I use the tidytext package to get the words out of each inaugural address.
# install.packages("tidytext")
library(tidytext)
all_inaugs_text %>%
select(-url) %>%
unnest() %>%
unnest_tokens(word, address_text) -> presidential_words
head(presidential_words)## President year word
## 1 George Washington 1789 fellow
## 1.1 George Washington 1789 citizens
## 1.2 George Washington 1789 of
## 1.3 George Washington 1789 the
## 1.4 George Washington 1789 senate
## 1.5 George Washington 1789 and
presidential_words %>%
group_by(President,year) %>%
summarize(num_words = n()) %>%
arrange(desc(num_words)) -> presidential_wordtotalsFirst, get all the URLs for the Wikipedia articles for the years of 1987-2016.
years <- 1987:2016
urls <- paste0("https://en.wikipedia.org/wiki/", years, "#Deaths")Next, create a data frame to store all of the data.
celebDeaths <- data.frame(year = years, url = urls,
stringsAsFactors = FALSE)urls[1] %>% read_html() %>% html_children() %>%
html_nodes("h2")## {xml_nodeset (8)}
## [1] <h2>Contents</h2>
## [2] <h2>\n<span class="mw-headline" id="Events">Events</span><span class ...
## [3] <h2>\n<span class="mw-headline" id="Births">Births</span><span class ...
## [4] <h2>\n<span class="mw-headline" id="Deaths">Deaths</span><span class ...
## [5] <h2>\n<span class="mw-headline" id="In_fiction">In fiction</span><sp ...
## [6] <h2>\n<span class="mw-headline" id="Nobel_Prizes">Nobel Prizes</span ...
## [7] <h2>\n<span class="mw-headline" id="References">References</span><sp ...
## [8] <h2>Navigation menu</h2>
urls[1] %>% read_html() %>% html_children() %>%
html_nodes("li")## {xml_nodeset (1361)}
## [1] <li><a href="/wiki/19th_century" title="19th century">19th century< ...
## [2] <li><b><a href="/wiki/20th_century" title="20th century">20th centu ...
## [3] <li><a href="/wiki/21st_century" title="21st century">21st century< ...
## [4] <li><a href="/wiki/1960s" title="1960s">1960s</a></li>
## [5] <li><a href="/wiki/1970s" title="1970s">1970s</a></li>
## [6] <li><b><a href="/wiki/1980s" title="1980s">1980s</a></b></li>
## [7] <li><a href="/wiki/1990s" title="1990s">1990s</a></li>
## [8] <li><a href="/wiki/2000s_(decade)" title="2000s (decade)">2000s</a> ...
## [9] <li><a href="/wiki/1984" title="1984">1984</a></li>
## [10] <li><a href="/wiki/1985" title="1985">1985</a></li>
## [11] <li><a href="/wiki/1986" title="1986">1986</a></li>
## [12] <li><b><strong class="selflink">1987</strong></b></li>
## [13] <li><a href="/wiki/1988" title="1988">1988</a></li>
## [14] <li><a href="/wiki/1989" title="1989">1989</a></li>
## [15] <li><a href="/wiki/1990" title="1990">1990</a></li>
## [16] <li><a href="/wiki/1987_in_archaeology" title="1987 in archaeology" ...
## [17] <li><a href="/wiki/1987_in_architecture" title="1987 in architectur ...
## [18] <li><a href="/wiki/1987_in_art" title="1987 in art">Art</a></li>
## [19] <li><a href="/wiki/1987_in_aviation" title="1987 in aviation">Aviat ...
## [20] <li><a href="/wiki/Category:1987_awards" title="Category:1987 award ...
## ...
get_deaths <- function(url){
# get the main content page
page <- url %>% read_html() %>%
html_nodes("#mw-content-text") %>% html_children()
# get the names of all elements
tagnames <- page %>% html_name()
# where are the big section headers
h2s <- which(tagnames == "h2")
# to find the heading labeled "Deaths"
h2childids <- page[h2s] %>% html_children() %>% html_attr("id")
idDeaths <- which(h2childids == "Deaths")
# list of deaths starts after the location of deathStart and
# ends immediately before the location of deathEnd (next big header)
deathStart <- h2s[(idDeaths+1)/2]
deathEnd <- h2s[(idDeaths+1)/2+1]
# get the deaths
death_elements <- page[(deathStart+1):(deathEnd-1)]
deaths <- death_elements %>% html_nodes("li") %>% html_text()(continued on next slide)
# there are two types of deaths: there was only one death that day in that year (a)
deathsa <- data.frame(death = deaths[grep("–", deaths)])
deathsa <- deathsa %>%
separate(death, into = c("Date", "Person"), sep = " – ") %>%
separate(Date, into = c("Month", "Day"), sep = " ") %>%
separate(Person, into = c("Name", "Desc"), sep = ", ", extra = "merge")
# or there were multiple deaths that day in that year (b)
deathsb <- data.frame(death = deaths[-grep("–", deaths)], stringsAsFactors = F)
# remove repeats
deathsb <- data.frame(death = deathsb[grep("\n",deathsb$death),], stringsAsFactors = F)
# tidy up the data
deathsb %>%
separate(death, into = c("Date", "Other"), sep = "\\n", extra="merge") %>%
separate(Other, into = paste0("Person", 1:6), sep = "\\n", fill = "right") %>%
gather(Person, Desc, -Date) %>%
select(Date, Desc) %>%
filter(!is.na(Desc)) -> deathsb
deathsb %>% separate(Desc, into = c("Name", "Desc"), sep = ", ", extra = "merge") %>%
separate(Date, into = c("Month", "Day"), sep = " ") %>%
filter(!is.na(Desc)) -> deathsb
#combine the 2 sets
deaths <- rbind(deathsa, deathsb)
return(deaths)
} # should take about 10 seconds
celebDeaths <- celebDeaths %>%
mutate(Deaths = map(url, get_deaths)) %>%
unnest()
head(celebDeaths[,-2])## year Month Day Name
## 1 1987 January 6 Harry D. Payne
## 2 1987 January 9 Arthur Lake
## 3 1987 January 10 Hakan Malmrot
## 4 1987 January 14 Douglas Sirk
## 5 1987 January 15 Ray Bolger
## 6 1987 January 19 Gerald Brenan
## Desc
## 1 American architect (b. 1891)
## 2 American actor, Dagwood Bumstead in Blondie (b. 1905)
## 3 Swedish swimmer (b. 1900)
## 4 German-born film director, Hollywood melodramas Magnificent Obsession, All That Heaven Allows, Written on the Wind, Imitation of Life (b. 1897)
## 5 American actor, singer, and dancer. Scarecrow in The Wizard of Oz (b. 1904)
## 6 British writer and Hispanist (b. 1894)
celebDeaths %>%
group_by(year) %>%
summarise(num_deaths = n()) %>%
arrange(desc(num_deaths)) %>%
head(10)## # A tibble: 10 × 2
## year num_deaths
## <int> <int>
## 1 2016 358
## 2 1993 314
## 3 2015 309
## 4 1990 305
## 5 1991 294
## 6 1992 286
## 7 1989 266
## 8 1996 265
## 9 1995 249
## 10 1998 248
Rtidyverservest